System and process for validating, aligning and reordering one or more genetic sequence maps using at least one ordered restriction map

ABSTRACT

A method and system are provided for comparing ordered segments of a first DNA restriction map with ordered segments of a second DNA restriction map to determine a level of accuracy the first DNA map and/or the second DNA map. In particular, the first and second DNA maps can be received (the first DNA map corresponding to a sequence DNA map, and the second DNA map corresponding to a genomic consensus DNA map as provided in an optical DNA map). Then, the accuracy of the first DNA map and/or the second DNA map is validated based on information associated with the first and second DNA maps. In addition, a method and system are provided for aligning a plurality of DNA sequences with a ordered DNA restriction map. The DNA sequences and the DNA map are received (the DNA sequences being fragments of a genome and the DNA map corresponding to a genomic consensus DNA map which relates to an optical ordered DNA map). Then, a level of accuracy of the DNA sequences and the DNA map is obtained based on information associated with the DNA sequences and the DNA map by means of the method and system described above. The locations of the DNA map at which the DNA sequences are capable of being associated with particular segments of the DNA map are located. Furthermore, it is possible to obtain locations of the DNA map (without the validation) by locating an optimal one of the locations for each of the DNA sequences for each of the locations.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a national stage application of PCT Application No.PCT/US01/30426 which was filed on Sep. 28, 2001 and published on Apr. 4,2002 as International Publication No. WO 02/26934 (the “InternationalApplication”). This application claims priority from the InternationalApplication pursuant to 35 U.S.C. §365. The present application alsoclaims priority under 35 U.S.C. §119 from U.S. Patent Application Ser.Nos. 60/236,296 and 60/293,254, filed on Sep. 28, 2000 and May 24, 2001,respectively. The entire disclosures of these applications areincorporated herein by reference.

FIELD OF THE INVENTION

The present invention relates to a system and process for a sequencevalidation based on at least one ordered restriction map, and moreparticularly to validating, aligning and/or reordering one or moregenetic sequence maps (e.g., ordered restriction enzyme DNA maps) usingsuch ordered restriction map via map matching and comparison.

BACKGROUND INFORMATION

The sequence of nucleotide bases present in strands of nucleotides, suchas DNA and RNA, carries the genetic information encoding proteins andRNAs. The ability to accurately determine a nucleotide sequence iscrucial to many areas in molecular biology. For example, the study ofgenetics relies on complete nucleotide sequences of the organism. Manyefforts have been made to generate complete nucleotide sequences forvarious organisms, including humans, mice, worms, flies and microbes.

There are a variety of well-known methods to sequence nucleotides,including the Sanger dideoxy chain termination sequencing technique andthe Maxam-Gilbert chemical sequencing technique. However, the currenttechnology limits the length of a nucleotide sequence that may besequenced. Techniques have been developed to sequence larger nucleotidesequences. In general, these methods involve fragmenting the largesequence into fragments, cloning the fragments, and sequencing thecloned fragments. The sequences can be fragmented through the use ofrestriction enzymes or mechanical shearing. Cloning techniques includethe use of cloning vectors such as cosmids, bacteriophage, and yeast orbacterial artificial chromosomes (YAC or BAC). The nucleotide sequenceof the fragments can then be compared, overlapping regions identified,and the sequences assembled to form “contigs,” which are sets ofoverlapping clones. By assembling the overlapping clones, it is possibleto determine the sequence of nucleotide bases of the full lengthsequence. These methods are well known to those having ordinary skill inthe art.

The accuracy of nucleotide sequence data is limited by numerous factors.For example, there may be missing sections due to incompleterepresentation of the genomic DNA. There may also be spurious DNAsequences intermixed with the desired genomic DNA. Common sources ofcontamination are vector-derived DNA and host cell DNA. Also, theaccuracy of the identification of bases tends to degrade toward the endof long sequence reads. Additionally, repeated sequences can createerrors in the re-assembly and/or the mismatching of contigs.

In order to reduce the sequence data errors, sequencing of the fragmentsis generally performed multiple times. To help reduce errors such asmismatching or misassembly resulting from repeated sequences, the“hierarchical shotgun sequencing” approach (also referred to as“map-based,” “BAC-based” or “clone by clone”) can be used. This approachinvolves generating and organizing a set of large insert clones coveringthe genome and separately performing shotgun sequencing on appropriatelyselected clones. Because the sequence information is local, the issue oflong-range misassembly is eliminated and the risk of short-rangemisassembly is reduced.

Other known sequencing and characterization techniques involvegenerating restriction fragment fingerprints to determine whether closeoverlaps are present, thereby assembling the BACs into fingerprint clonecontigs. Fingerprint clone contigs can be positioned along thechromosome by anchoring them with sequence-tagged sites (STS) markersfrom existing genetic and physical maps. These fingerprint clone contigscan be associated with specific STSs by probe hybridization or directsearch of the sequenced clones. Clones can also be positioned byfluorescence in situ hybridization. Each of these known techniques arecostly and time consuming.

Another approach for characterizing nucleotide sequences involves theuse of ordered restriction maps of single molecules. One specifictechnique used to produce single molecule ordered restriction maps is“Optical Mapping”. Optical mapping is a single molecule methodology forthe rapid production of ordered restriction maps from individual DNAmolecules. Ordered restriction maps are preferably constructed usingfluorescence microscopy to visualize restriction endonuclease cuttingevents on individual fluorochrome-stained DNA molecules. Restrictionenzyme cleavage sites are visible as gaps that appear flanking therelaxed DNA fragments (pieces of molecules between two consecutivecleavages). Relative fluorescence intensity (measuring the amount offluorochrome binding to the restriction fragment) or apparent lengthmeasurements (along a well-defined “backbone” spanning the restrictionfragment) have proven to provide accurate size-estimates of therestriction fragment and have been used to construct the finalrestriction map.

Such restriction map created from one individual DNA molecule is limitedin its accuracy by the resolution of the microscopy, the imaging system(CCD camera, quantization level, etc.), illumination and surfaceconditions. Furthermore, depending on the digestion rate and the noiseinherent to the intensity distribution along the DNA molecule, with someprobability, one is likely to miss a small fraction of the restrictionsites or introduce spurious sites. Additionally, investigators maysometimes (rather infrequently) lack the exact orientation information(whether the left-most restriction site is the first or the last). Thus,given two arbitrary single molecule restriction maps for the same DNAclone obtained this way, the maps are expected to be roughly the same inthe following sense—if the maps are “aligned” by first choosing theorientation and then identifying the restrictions sites that differ bysmall amount, then most of the restrictions sites will appear roughly atthe same place in both the maps.

For instance, in the original method, fluorescently-labeled DNAmolecules were elongated in a flow of molten agarose containingrestriction endonucleases, generated between a cover-slip and amicroscope slide, and the resulting cleavage events were recorded byfluorescence microscopy as time-lapse digitized images. The secondgeneration optical mapping approach, which dispensed with agarose andtime-lapsed imaging, involves fixing elongated DNA molecules ontopositively-charged glass surfaces, thus improving sizing precision aswell as throughput for a wide range of cloning vectors (cosmid,bacteriophage, and yeast or bacterial artificial chromosomes (YAC orBAC)).

A DNA sequence map is an “in silico” order restriction map that isobtained for a nucleotide sequence by simulating a restriction enzymedigestion process. The sequence data is analyzed and restriction sitesare identified in a predetermined manner. The resulting sequence map hassome piece of identification data plus a vector of fragments, whoseelements encode the size in base-pairs.

Sequenced clones can be associated with fingerprint clone contigs in thephysical map by using the sequence data to calculate a partial list ofrestriction fragments in silico and comparing that list with theexperimental database of BAC fingerprints. Genomic consensus maps aregenerated from optical maps using, e.g., “Gentig” software which is aconventional software that generates optical ordered restriction maps.

It was previously unknown how to determine the accuracy of the DNAsequence maps. Indeed such determination was either impossible orprovided a small level of surety. It is one of the objects of thepresent invention to enable a validation of the DNA ordered sequencemaps against the optical maps. Another object of the present inventionis to enable an alignment and reordering of the DNA sequence maps basedon the optical mapping.

Approaches to aligning or reconstructing restriction maps have beendescribed in E. W. Myers et al., “An O(N2 lg N) Restriction MapComparison and Search Algorithm”, Bulletin of Mathematical Biology,54(4):599-618, 1992; R. M. Karp et al., “Algorithms for OpticalMapping”, RECOMB 98, 1998; Parida, L., A Uniform Framework for OrderedRestriction Map Problems, Journal of Computational Biology, Vol 5, No 4,Mary Ann Liebert Inc. Publishers, pp 725-739, 1998; Gusfield, D.,Algorithms on Strings, Trees, and Sequences, Cambridge University Press,1997; and Lee, J. K., Dancik, V., and M. S. Waterman, “Estimation forrestriction sites observed by optical mapping using reversible-jumpMarkov Chain Monte Carlo”, J. Comp. Biol., 5, 505-516, 1997. However,none of these publications disclose the novel processes and systemsdescribed herein below.

SUMMARY OF THE INVENTION

In general, an exemplary embodiment of the system and process forvalidating and aligning the simulated ordered restriction map againstthe optical ordered restriction map according to the present inventioncan be implemented as follows. First, each molecule may be cut inseveral places using a digestion process by one or more restrictionenzymes as is known to those having ordinary skill in the art. Each ofthese “cut” molecules can represent a partial DNA (optical) orderedrestriction map. Then, it is possible to reconstruct a complete GenomeWide (optical) ordered restriction map. Such reconstruction process canbe carried out by an iterative process which maximizes the likelihood ofa plausible hypothesis given the partial map and the model of the errorsources (e.g., a Bayesian-based process).

It should be understood that the inputs to the Validation/Alignmentsystem and process are preferably restriction maps (which include DNAsequences therein) and Genome wide (e.g. optical) ordered restrictionmaps (which can be represented as variable length vectors ofsegment/fragment information fields). Each segment information has twopieces of information associated therewith: size and standard deviation.The size may be a measure of the segment, which is proportional to thenumber of nucleotides present in the segment. The standard deviationpreferably represents the error associated with the segment sizemeasurement. Each map has associated therewith, e.g., two measures ofhow reliable the detection of cuts by the procedure is, i.e., the falsepositive probability and the digestion probability. The first measurerelates to the event that the cut is detected incorrectly. The secondmeasure relates to the event that the cut actually appears where it isreported.

According to the present invention, the optical and simulated orderedrestriction maps are compared to one another to determine whether and towhat extent they match. The accuracy of a match is computed byminimizing the error committed by matching one map against the other ata given position. An exemplary mathematical model and procedureunderlying this computation is preferably a Bayesian-basedprocedure/algorithm. The computation is based on a Dynamic ProgrammingProcedure (“DPP”). However, it should be understood that otherprocedures and algorithms can be utilized to compare to one anotherthese maps to validate and align at least one such map, according to thepresent invention.

Using the Bayesian-based exemplary procedure with the system and methodof the present invention, hypothesis can be obtained and the probabilityof a given event (based on the hypothesis) may be formulated. Thisprobability is preferably a mathematical formula, which is then computedusing a conventional model of various error sources. An exemplaryoptimization process which uses such formula may maximize or minimizethe formula.

In order to find the extreme value of the overall probability formulaover all possible combinations of matches, a conventional DPP can beused on the problem which was defined by the Bayesian-based exemplaryprocedure as described above. For example, the DPP may preferablycompute a set of extreme values for a mathematical formula defined aboveby extending a partial solution in a predetermined manner while keepingtrack of a particular number of alternatives. All of the alternativesmay be maintained in a table, and thus do not have to be recomputedevery time the associated likelihood or score function needs to beevaluated.

Accordingly, a method and system according to the present invention areprovided for comparing ordered segments of a first DNA map with orderedsegments of a second DNA map to determine a level of accuracy the firstDNA map and/or the second DNA map. In particular, the first and secondDNA maps can be received (the first DNA map corresponding to a sequenceDNA map, and the second DNA map corresponding to a genomic consensus DNAmap as provided in an optical DNA map). Then, the accuracy of the firstDNA map and/or the second DNA map is validated based on informationassociated with the first and second DNA maps.

In another embodiment of the present invention, the first DNA map and/orthe second DNA map are validated by determining whether one or morematches exist between ordered segments of the first DNA map and theordered segments of the second DNA map. In addition, a number of thematches which exist between the segments of the first DNA map and thesegments of the second DNA map can be obtained.

In yet another embodiment of the present invention, the first DNA mapand/or the second DNA map are validated by determining whether the firstDNA map includes one or more cuts which are missing from the second DNAmap. Also, a number and locations of the missing cuts based on the firstand second DNA maps can be obtained thereafter.

According to a further embodiment of the present invention, the firstDNA map and/or the second DNA map are validated by determining whetherthe second DNA map includes one or more cuts which are absent from thefirst DNA map. The validation can also be performed by determiningwhether the first DNA map includes one or more cuts which are missingfrom the second DNA map, obtaining a first number and locations of themissing cuts based on the first and second DNA maps, determining whetherthe second DNA map includes one or cuts which are absent from the firstDNA map, and obtaining a second number and locations of the absent cutsbased on the first and second DNA maps. Furthermore, it is possible togenerate an error indication if the number of the matches is less than amatch threshold, the first number of the missing cuts is greater than afirst predetermined threshold, and/or the second number of the absentcuts is greater than a second predetermined threshold.

In another embodiment of the present invention, the first DNA map is anin-silico ordered restriction map obtained from a DNA sequence, whichmay include identification data and at least one vector of the segmentsof the first DNA map. At least one vector of the first segments canencode a size of base-pairs of the DNA sequence. Further, the second DNAmap can include identification data and at least one variable-lengthvector representing its ordered segments.

In still another embodiment of the present invention, the second DNA mapis defined as a subsequence of a genome-wide ordered restriction map.Also, the validation is performed by determining the accuracy of atleast one of the first DNA map and the second DNA map using thefollowing probability density function:Pr(D|Ĥ(σ,p_(c),p_(f)))where D is the second DNA map, Ĥ is the first DNA map, σ is a standarddeviation summarizing map-wide standard deviation data, p_(c) is aprobability of a positive cut of a DNA sequence, and p_(f) is aprobability of a false-positive cut of the DNA sequence.

In another embodiment of the present invention, the accuracy can bevalidated as a function of an orientation of the first DNA map withrespect to an orientation of the second DNA map. Also, the validationcan be performed by executing a dynamic programming procedure (“DPP”) onthe first and second DNA maps to generate a first table of partial andcomplete alignment scores, and first auxiliary tables and first datastructures to keep track of number and locations of cuts and segmentmatches, receiving a third DNA map which is a reverse map of the firstDNA map, executing the DPP on the second and third DNA maps to generatea second table of partial and complete alignment scores, and secondauxiliary tables and second data structures to keep track of number andlocations of the cuts and the segment matches, analyzing a last row ofthe first table and a last row of the second table to obtain at leastone optimum alignment of the first and second DNA maps, andreconstructing an optimum alignment and/or sub-optimal alignments usingthe first and second auxiliary tables and data structures.

According to still another embodiment of the present invention, theaccuracy can be validated by matching an extension of one or more leftend segment of the segments of the first DNA map to at least one segmentof the second DNA map and/or by matching an extension of one or moreright end segment of the segments of the first DNA map to at least onesegment of the second DNA map. Furthermore, it is possible to detect analignment of the first DNA map with respect to the second DNA map, thealignment being indicative of sequence positions of the segments of thefirst DNA map along the second DNA map.

In addition, other embodiments of the process and system according tothe present invention are provided for aligning a plurality of DNAsequences with a DNA map. First, the DNA sequences and the DNA map canbe received (the DNA sequences being fragments of a genome and the DNAmap corresponding to a genomic consensus DNA map which relates to anordered restriction—e.g. optical—DNA map). Then, a level of accuracy ofthe DNA sequences and the DNA map is validated based on informationassociated with the DNA sequences and the DNA map. The locations of theDNA map at which the DNA sequences are capable of being associated withparticular segments of the DNA map are located. Furthermore, it ispossible to obtain locations of the DNA map (without the validation) bylocating an optimal one of the locations for each of the DNA sequencesfor each of the locations.

In another embodiment of the present invention, the locations aredetermined for each of the DNA sequences, they may be positions on theDNA map at which the corresponding DNA sequences are anchorable, andthese locations can define at least one alignment of the DNA sequenceswith respect to the DNA map. The alignment may include multiplealignments of the DNA sequences with respect to the DNA map, and themultiple alignments may be ranked based on a predetermined criteria toobtain a score set which includes a particular score for each of themultiple alignments. The determination may be performed by providing theDNA sequences in a first order of the multiple alignments with respectto the DNA map and determining a position for each of the DNA sequences,with respect to the DNA map, by selecting the DNA sequences to be in asecond order corresponding to the score set.

In still another embodiment of the present invention, the determinationof the locations can be performed by restricting each of the DNAsequences to be associated with only one of the locations on the DNAmap. Also, such determination may produce a single alignment of the DNAsequences with respect to the DNA map.

In yet another embodiment of the present invention, the determinationcan be performed by locating an optimal one of the locations for each ofthe DNA sequences to obtain an alignment solution for each of thelocations. Also, the locating of the optimal location may be repeatedfor each subsequent one of the locations and excluding the alignmentsolution from a preceding locating procedure. Furthermore, eachsubsequent locating procedure can be made by relaxing at least oneparticular constraint to determine the respective locations. Theparticular constraint preferably includes a first requirement that twoof the DNA sequences are prevented from overlapping when associated withthe respective locations on the DNA map. The particular constraint caninclude a second requirement that a maximum number of the DNA sequencesare associated with the respective locations on the DNA map, and a thirdrequirement that an overall score of the alignment of the DNA sequenceswith respect to the locations on the DNA map is minimized or maximized.It is also possible to assign respective weighs to the secondrequirement and the third requirement.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present invention and itsadvantages, reference is now made to the following description, taken inconjunction with the accompanying drawings, in which:

FIG. 1 is a first exemplary embodiment of a system for validating,aligning and/or reordering a genetic sequence using an optical map viamap matching and comparison according the present invention;

FIG. 2 is a second exemplary embodiment of a system for validating,aligning and/or reordering a genetic sequence using the optical map;

FIG. 3 is an exemplary embodiment of a validation procedure of a processaccording to the present invention;

FIG. 4 is an exemplary embodiment of the process according to thepresent invention for simulating a restriction digestion of the sequencemap, and then validating the accuracy of the consensus optical orderrestriction map and/or the simulated map;

FIG. 5A is a detailed flow chart of an exemplary validation techniqueutilized in the process shown in FIG. 4;

FIG. 5B is a detailed illustration of an exemplary flow diagram ofparticular steps of FIG. 5A in which fragments of the optical orderedrestriction map are compared to fragment of the simulated orderedrestriction map to obtain one or more set(s) of most likely matches;

FIG. 6A is a first exemplary illustration of a technique for matching asequence map against a consensus optical map;

FIG. 6B is a second exemplary illustration of the technique for matchingthe sequence map against the consensus optical map in which theconsensus optical map does not possess any false enzyme cuts and thesequence map does not have any missing enzyme cut(s);

FIG. 6C is a third exemplary illustration of the technique for matchingthe sequence map against the consensus optical map in which theconsensus optical map does not possess any false enzyme cuts while thesequence map is missing the enzyme cut(s);

FIG. 6D is a fourth exemplary illustration of the technique for matchingthe sequence map against the consensus optical map in which theconsensus optical map has a false enzyme cut and the sequence map doesnot have any missing enzyme cuts;

FIG. 6E is a fifth exemplary illustration of the technique for matchingthe sequence map against the consensus optical map in which theconsensus optical map has a false enzyme cut and the sequence map ismissing the enzyme cut;

FIG. 6F is a sixth exemplary illustration of the technique for matchingthe sequence map against the consensus optical map in which leftfragments of each of the consensus optical and sequence maps aremismatched;

FIG. 6G is a sixth exemplary illustration of the technique for matchingthe sequence map against the consensus optical map in which rightfragments of each of the consensus optical and sequence maps aremismatched;

FIG. 7 is a detailed illustration of the exemplary flow diagram of thevalidation procedure according to the present invention which utilizesdynamic programming principles and the sequence and consensus mapsillustrated in FIGS. 6F and 6G;

FIG. 8 is an exemplary embodiment of the process according to thepresent invention in which an alignment of a simulated order restrictedmap takes place after (or during) the validation technique has beenimplemented to determine the accuracy of the simulated order restrictedmap(s) and/or the consensus optical map(s);

FIG. 9 is a detailed illustration of the flow diagram of the processshow in FIG. 8;

FIG. 10 is a flow diagram of a particular set of steps in the processillustrated in FIG. 9 in which best matches are selected for eachsequence map and an overall alignment thereof is constructed; and

FIG. 11 is an illustration of an example of a possible alignment of achromosome arrangement using the system and process of the presentinvention.

DETAILED DESCRIPTION

FIG. 1 illustrates a first exemplary embodiment of a system forvalidating, aligning and/or reordering a genetic sequence using anoptical (consensus) map via map matching and comparison according to thepresent invention. In this embodiment, the system includes a processingdevice 10 which is connected to a communications network 100 (e.g., theInternet) so that it can receive opticalsequence mapping data and DNAsequence data. The processing device 10 can be a mini-computer (e.g.,“HEWLETT PACKARD”-brand mini computer), a personal computer (e.g., a“PENTIUM”-brand chip-based computer), a mainframe (e.g., “IBM”-brand3090 system), and the like. The DNA sequence data can be provided from anumber of sources. For example, this data can be “GENBANK”-brand Data110 obtained from GenBank database (NIH genetic sequence database),Sanger Data 120 obtained from Sanger Center database, and/or“CELERA”-brand Data 130 obtained from the Celera Genomics database.These are publicly available genetic databases, or—in the lastcase—private commercial genetic databases. “Hewlett Packard” is aregistered trade-mark of Hewlett-Packard Corporation (Palo Alto, Calif.,USA), “Pentium” is a registered trade-mark of Intel Corporation (SantaClara, Calif., USA), “IBM” is a registered trade-mark of InternationalBusiness Machines Corporation (Armonk, N.Y., USA), “Celera” is aregistered trade-mark of Celera Corporation (Alameda, California, USA).“GENBANK” is a registered trademark of the US Department of Health andHuman Services (Bethesda, Md., USA). The optical sequence mapping datacorrespond to optical mapping data 140 that can obtained from externalsystems. For example, such optical map data, i.e., optical mappingordered restriction data, can be generated using the methods describedin U.S. Pat. No. 6,174,671. In particular, the methods described in thisU.S. patent produce high-resolution, high accuracy ordered restrictionmaps based on data created from images of populations of individual DNAmolecules digested by restriction enzymes.

As shown in FIG. 1, after the processing device 10 receives the opticalmapping data and the DNA sequence data via the communications network100, it can then generate one or more results 20 which can be avalidation/determination of the accuracy of the DNA sequence data and/orof the optical mapping data, an alignment of the DNA sequence data basedon the results of the validation procedure, and reordering thereof FIG.2 illustrates another embodiment of the system 10 according to thepresent invention in which the optical mapping data 140 is transmittedto the system 10 directly from an external source, without the use ofthe communications network 100 for such transfer of the data. In thissecond embodiment of the system as shown in FIG. 2, the DNA sequencedata 110, 120, 130 is also transmitted directly from the one or more ofthe DNA sequence databases (e.g., the Sanger Center database, the CeleraGenomics database and/or the GenBank database), without the need to usethe communications network 100 shown in the first embodiment of FIG. 1.It is also possible for the optical mapping data 140 to be obtained froma storage device provided in or connected to the processing device 10.Such storage device can be a hard drive, a CD-ROM, etc. which are knownto those having ordinary skill in the art.

A. Validation Process and System

General Flow Diagram

FIG. 3 is an exemplary embodiment of the process according to thepresent invention which is preferably executed by the processing device10 of FIGS. 1 and 2. In this exemplary embodiment, the optical mappingdata 140 is forwarded to a technique 250 which constructs one or moreconsensus maps 260, based on this data 140 by considering the localvariations among aligned single molecule maps. One example of suchtechnique 250 is a “gentig” computer program as described in T.Anantharaman et al., “Genomics via Optical Mapping II: OrderedRestriction Maps”, Journal of Computational Biology, 4(2), 1997, pp.91-118, and T. Anantharaman et al., “Genomics via Optical Mapping III:Contiging Genomic DNA and Variations”, AAAI Press, 7th InternationalConference on Intelligent Systems for Molecular Biology, ISMB 99, Vol.7, 1999, pp. 18-27, the entire disclosure of which are incorporatedherein by reference. In particular, “gentig” software uses aBayesian-based (probabilistic) approach to automatically generate“contigs” from optical mapping data. For example, “contigs” can beassembled over whole microbial genomes. The “gentig” software repeatedlycombines two islands that produce the greatest increase in probabilitydensity, excluding any “contigs” whose false positive overlapprobability are unacceptable. For example, four parameters in theprogram can be altered to change the number of molecules that theprogram “contigs” together, thus forming the consensus maps. The detailsof the consensus maps shall be described herein below in furtherdetails.

According to the exemplary embodiment of the present invention, the DNAsequence data (e.g., the GenBank data 110, the Sanger data 120 and theCelera data 130) can be collected at a database collection junction 200,which can be a computer program executed by the processing device 10.This collection can be initiated and/or controlled either manually(e.g., by a user of the processing device 10 to obtain particular DNAsequences) and/or automatically using the processing device 10 oranother external device. Upon the collection of the DNA sequence datafrom one or more of the DNA sequence databases 110, 120, 130, thedatabase collection junction 200 outputs a particular DNA sequence 210or a portion of such DNA sequence. Thereafter, the data for this DNAsequence 210 (or a portion thereof) is forwarded to a technique 220which simulates a restriction enzyme digestion process to generate an“in silico” ordered restriction sequence map 230.

Thereafter, the system and process of the present invention executes avalidation algorithm 270 which determines the accuracy of the orderedrestriction sequence map 230 based on the data provided in the opticalconsensus map(s) 260. This result can be output as or more results 280in the form of a response a score (e.g., a rank for each orderedrestriction map), a binary output (e.g., the accuracy validated vs.unvalidated), etc.

Provided herein below is a detailed information regarding the consensusmaps and the sequence maps.

Consensus (Optical) Map

The consensus optical map can be defined as a genome-wide, orderedrestriction map which is represented as a structured item consisting ofparticular identification data and a variable length vector composed offragments. For example, the consensus map can be represented by a vectorof fragments, where each fragment is a triple of positive real numbers.<c _(i) ,l _(i),σ_(i) >εR ³and where c_(i) is defined as the cut probability associated with aBernoulli Trial, l_(i) is the fragment size, related to the mean of arandom variable with Gaussian distribution having an estimated standarddeviation equal to σ_(i). For example, the total length of the fragmentvector as can be defined as N. Also, it is possible to define an indexthe vector of fragments from 0 to N−1.

The consensus maps can be created from several long genomic singlemolecule maps, where each molecule map thereof may be obtained from theimages of the molecules stretched on a surface and further combined by aBayesian algorithm implemented in the “gentig” program. As describedabove, the “gentig” program is capable of constructing consensus maps byconsidering local variations among the aligned single molecule maps.

Sequence Map

As is generally known, a sequence is a string of letters obtained from aset {A, C, G, T, N, X}. These letter have a standard meaning in the artif bio-informatics. In particular, the letters A, C, G, T are DNA bases,N is “unknown”, and X is a “gap”.

A sequence map is an “in silico” ordered restriction map obtained fromthe sequence by simulating a restriction enzyme digestion process.Hence, each sequence map has some piece of identification data plus thevector of fragments, whose elements encode exactly the size inbase-pairs. The sequence map fragment vector j-th element is defined asa number a_(j) which is the size of the fragment. The total length ofthe sequence map fragment vector is defined as M. The fragment vector isindexed from 0 to M−1.

Thus, each sequence map has at least a portion of identification data ofthe DNA sequence data 110, 120, 130, in addition to the vector offragments whose elements encode exactly the size in base-pairs. Thesequence map fragment vector j-th element is indicative of a numbera_(j) which corresponds to the size of the fragment. As an example, thetotal length of the ordered restriction sequence map fragment vector canbe M. Thus, the fragment vector can be indexed from 0 to M−1.

Overall Process Description

FIG. 4 shows an exemplary flow chart of the embodiment of the processaccording to the present invention for simulating a restrictiondigestion of the sequence map, and then validating the accuracy of theconsensus optical order restriction map and/or the simulated orderedmap. This process can be performed by the processing device 10 which isshown in FIGS. 1 and 2. As shown in this flow chart, the processingdevice 10 receives the optical ordered restriction data in step 310,which can be the consensus optical map(s) 260 shown in FIG. 3. Then, instep 320, the processing device 10 receives the DNA sequence data, whichis preferably the DNA sequence 210 which is also shown in FIG. 3. Instep 330, the restriction digestion of the sequence data is simulated toobtain the simulated (in silico) ordered restriction map which is alsoshown in FIG. 3 as the sequence map(s) 230. Thereafter, in step 340, theaccuracy of the optical ordered restriction map and/or of the simulatedordered restriction map is validated, preferably to locate likelymatches within one another. Finally, the results of the validation aregenerated in step 350.

Exemplary Embodiment of Validation Procedure of the Exemplary Process

FIG. 5A shows a detailed flow chart of an embodiment of the exemplaryvalidation procedure utilized in step 340 of the process shown in FIG.4. In particular, a current fragment of the optical ordered restrictionmap is compared to a respective fragment of the simulated orderedrestriction map to obtain one or more set(s) of most likely matches(step 3410). Then, the processing device 10 determines if all fragmentsof the simulated ordered restriction map were checked in step 3420. Ifnot, the process takes the next fragment of the simulated orderedrestriction map to be the current fragment for checking performed instep 3430, and the comparison of step 3410 is repeated again for thecurrent fragment of the simulated ordered restriction map. Otherwise,because it is determined that all fragments of the simulated orderedrestriction map were checked, all of the matches are ranked in step3440, and the processing device 10 determines the best match(s) in step3450. If the processing device 10 determines that the rank of the bestmatch(s) is greater than a predetermined threshold (step 3460), theprocessing device 10 validates the accuracy of the optical orderedrestriction map and/or of the simulated ordered restriction map (step3470). Otherwise such accuracy is not validated in step 3480. It shouldbe understood that the exemplary validation procedure shown in FIG. 5Acan be performed for one or multiple iterations over the fragments.

FIG. 5B shows a detailed illustration of an exemplary flow diagram ofsteps 3410-3430 of FIG. 5A in which the fragments of the optical orderedrestriction map are compared to the fragment of the simulated orderedrestriction map to obtain one or more set(s) of most likely matches.Particularly, in step 4010, Probability Pr(D|H(σ, p_(c), p_(f))) asshall be described in further detail below, is calculated for eachpossible alignment of the fragments of the optical ordered restrictionmap (i.e., the consensus map) against fragments of simulated orderedrestriction map (i.e., the sequence map). Then, in step 4020, an overallmatch probability as a maximum likelihood estimate (“MLE”) is calculatedby extending the computation over all fragments of the consensus map andall fragments of the sequence map.

The exemplary applications of the exemplary embodiment of the processaccording to the present invention on the sequence and consensus mapsare provided in further detail below with reference to FIGS. 6A-6G.

Statistical Description of the Problem

FIG. 6A shows an exemplary setup of the matching procedure involving asequence map (corresponding to the simulated ordered restriction map)and a consensus map (corresponding to the optical ordered restrictionmap). The sequence map is preferably considered to be an ideal map,i.e., viewed as the hypothesis H of a Bayesian problem to be analyzed,while the consensus map is preferably considered to be of data D to bevalidated against hypothesis H. In this manner the following probabilitydensity function is formedPr(D|H(σ,p_(c),p_(f))),where σ is a standard deviation which summarizes maps wide standardsdeviation data (e.g., σ=f(σ_(i)) for some function ‘f’), p_(c) is thecut probability, and p_(f) is the false positive cut probability. Thiscalculation is shown in FIG. 5 b and discussed above.

Ideal Scenario

In an ideal scenario, the orientations of the sequence maps are known,there are no false cuts, and no missing cuts, i.e., p_(c)=1, andp_(f)=0, thus the terms associated with these parameters vanish, as itshall be described in further detail below. For example, if a position hin the consensus map is taken, the consensus map fragment sub-vector isprovided from the position h to N−1. Also, the full fragment vector ofthe sequence map can be, e.g., from 0 to M−1. For the sake of simplicityof the explanation of the present invention, it is possible to removethe h position term of the consensus map fragment sub-vector, and countthe consensus map fragments from the position term 0 so that expressionssuch as l_(i), instead of l_(h+i), can be utilized.

To obtain a “match” between the i-th fragments of the consensus map andthe corresponding fragments of the sequence map, it is preferable toevaluate to what extent the consensus map and the sequence map deviatefrom one another. A Gaussian distribution should preferably be utilizedfor the i-th fragment of each of the maps, and the following expressionmay be evaluated:

$\frac{1}{\sqrt{2{\pi\sigma}_{i}^{2}}}{\mathbb{e}}^{- \frac{{({l_{i} - a_{j}})}^{2}}{2\sigma_{i}^{2}}}$

Given the above expression, and with the assumption that the sequencemap is correct (i.e., Pr(H)=1), the overall Pr(D|H(σ, . . . )) functioncan be provided as:

${\Pr\left( {D\text{❘}{H\left( {\sigma,\ldots} \right)}} \right)} = {\prod\limits_{i = 0}^{n}\;{\left( {\frac{1}{\sqrt{{2\pi}\;}\sigma_{i}}{\mathbb{e}}^{- \frac{{({l_{i} - a_{j}})}^{2}}{2\;\sigma_{i}^{2}}}} \right).}}$To maximize the likelihood of the validation, it is preferable toutilize the logarithm of the simplified expression and obtain thefollowing expression:

${\ln\left( {\Pr\left( {D\text{❘}{H\left( {\sigma,\ldots} \right)}} \right)} \right)} = {{\sum\limits_{i = 0}^{n}{\ln\left( \frac{1}{\sqrt{2\pi\;\sigma_{i}^{2}}} \right)}} - {\sum\limits_{i = 0}^{n}\left( \frac{\left( {l_{i} - a_{j}} \right)^{2}}{2\sigma_{i}^{2}} \right)}}$This express maximizes logarithmic likelihood, therefore it provides aMaximum Likelihood Estimate (“MLE”).

Since it is possible to assume that the first term of the MLE does notvary extensively from one location to another, it is preferable tosimplify the problem by minimizing a “weighted sum-of-error-square” costfunction.

${F(D)} = {\sum\limits_{i = 0}^{n}\left( \frac{\left( {l_{i} - a_{i}} \right)^{2}}{2\sigma_{i}^{2}} \right)}$Minimizing function F(D, . . . ) may yield the “best match” of thesequence map (represented as H) against the consensus map (representedas D).

According to the present invention, it is preferable to take intoaccount the two possible orientations of the sequence map with respectto the consensus map. Below, false cuts and missing cuts in theconsensus map are considered.

Orientation

Since the sequence map can be evaluated against the consensus map by“reversing” its orientation, the expression for Pr(D, σ, . . . |H) canbe rewritten as:Pr(D,|H( . . . ))=max[Pr ₁(D,|H( . . . )),Pr ₂(D|H ^(R)( . . . )],where H^(R) represents the reversed sequence map. As providedpreviously, it is possible to construct the function F as:F(D,H)=max[F ₁(D,H),F ₂(D,H ^(R))].

Thus, the expression for F₂(D, H^(R)) will be as follows:

${F_{2}\left( {D,H^{R}} \right)} = {\sum\limits_{i = 0}^{n}\left( \frac{\left( {l_{i} - a_{({n - i})}} \right)^{2}}{2\sigma_{i}^{2}} \right)}$

False Cuts and Missing Cuts

In order to correctly model errors in the matching process, it ispreferable to take into account false cuts and missing cuts. Forexample, the matching process can be modeled with two parameters:

-   -   Missing restriction sites in the sequence map are preferably        modeled by a probability p_(c) (i.e., a “cut” probability). In        particular, p_(c)=1 means that the restriction sites are        actually present in the map, 0≦p_(c)<1 means that there are some        missing cuts, etc.    -   False restriction sites in the consensus map are preferably        modeled by a rate parameter p_(f) (i.e., a “false” cut        probability). In an exemplary case, 0<p_(f)≦1 means that the        consensus map may have some false cuts.        These parameters should preferably be included in the expression        describing Pr( . . . ) and, therefore in the function F( . . . )        described above.

Example 1

No missing cuts and no false cuts. In this example as shown in FIG. 6B,the term for the matching of the i-th fragment of the sequence map 610against the i-th fragment of the consensus map 620 should preferablytake into account the cut probability p_(c). Thus, the expression is asfollows:

${p_{c} \times \frac{1}{\sqrt{2{\pi\sigma}_{i}^{2}}}}{{\mathbb{e}}^{- \frac{{({l_{i} - a_{j}})}^{2}}{2\sigma_{i}^{2}}}.}$which yields the cost function, after taking the negative loglikelihood.

${\ln\left( \frac{\sqrt{2\pi\;\sigma_{i}^{2}}}{p_{c}} \right)} + {\frac{\left( {l_{i} - a_{j}} \right)^{2}}{2\sigma_{i}^{2}}.}$

Example 2

Missing cuts and no false cuts. In this example and as shown in FIG. 6C,the exemplary embodiment of the system and method of the presentinvention considers a cut in the sequence map 630 that has nocorresponding cut in the consensus map 610. A match is attempted of thei-th consensus map fragment against the aggregation of the j and j−1fragments in the sequence map 630. For example, the computation of theGaussian expression should be “penalized” by taking into account themissing cut. The main term is preferably modeled as:

${p_{c} \times \frac{1}{\sqrt{2\pi\;\sigma_{i}^{2}}}}{{{\mathbb{e}}^{- \frac{{({l_{i} - {({a_{j} + a_{({j - 1})}})}})}^{2}}{2\sigma_{i}^{2}}} \times \left( {1 - p_{c}} \right)}.}$yielding a cost function:

${\ln\left( \frac{\sqrt{2\pi}\sigma_{i}}{p_{c}} \right)} + \frac{\left( {l_{i} - \left( {a_{j} + a_{({j - 1})}} \right)} \right)^{2}}{2\sigma_{i}^{2}} + {{\ln\left( \frac{1}{1 - p_{c}} \right)}.}$

Example 3

No missing cuts and some false cuts. In this case and as shown in FIG.6D, the converse case of Example 2 is being considered. A false cutevent of the consensus map 640 can be modeled as a Bernoulli trial withprobability p_(f). For example, the full term for such matching wouldlikely aggregate fragments i and i−1 of the consensus map 640 againstthe j-th fragment of the sequence map 620. The full term would likelybe:

${p_{c} \times \frac{1}{\sqrt{2{\pi\left( {\sigma_{i}^{2} + \sigma_{({i - 1})}^{2}} \right)}}}}{{{\mathbb{e}}^{- \frac{{({{({l_{i} + l_{({i - 1})}})} - a_{j}})}^{2}}{2{({\sigma_{i}^{2} + \sigma_{({i - 1})}^{2}})}}} \times p_{f}}.}$Taking the negative log likelihood again, the following expression isobtained:

${\ln\left( \frac{\sqrt{2{\pi\left( {\sigma_{i}^{2} + \sigma_{({i - 1})}^{2}} \right)}}}{p_{c}} \right)} + \frac{\left( {\left( {l_{i} + l_{({i - 1})}} \right) - a_{j}} \right)^{2}}{2\left( {\sigma_{i}^{2} + \sigma_{({i - 1})}^{2}} \right)} + {{\ln\left( \frac{1}{p_{f}} \right)}.}$

It should be noted that for the current data obtained from the opticalmapping process, p≃10⁻⁵. This current data often dominate the completeexpression.

Example 4

Some missing cuts and some false cuts. Of course, it is conceivable thatthere may be missing cuts and false cuts together as shown in FIG. 6E.It is possible to accurately match or align the i−u cut in the sequencemap 660 against the j−v cut in the consensus map 650. It is alsopossible to properly match the (i+1)-th cut (the cut immediatelyfollowing the i-th fragment in both the consensus map 650 and thesequence map 660) in the two maps by appropriately treating all theintervening missing cuts in sequence map 660 and all the interveningfalse cuts in the consensus map 650. In this case, the “matching term”has the following general form:

${p_{c} \times \frac{1}{\sqrt{2{\pi\left( {\sigma_{i}^{2} + \sigma_{({i - 1})}^{2} + \ldots + \sigma_{({i - v})}^{2}} \right)}}} \times {\mathbb{e}}^{- {(\frac{{({{({l_{i} + l_{({i - 1})} + \ldots + l_{({i - v})}})} - {({a_{j} + a_{({j - 1})} + \ldots + a_{({j - u})}})}})}^{2}}{2{({\sigma_{i}^{2} + \sigma_{({i - 1})}^{2} + \ldots + \sigma_{({i - v})}^{2}})}})}} \times \left( {1 - p_{c}} \right)^{({u - 1})} \times p_{f}^{({v - 1})}}.$Taking the negative log likelihood, the following expression isobtained:

${{- \ln}\; p_{c}} + {\ln\left( \sqrt{2{\pi\left( {\sigma_{i}^{2} + \sigma_{({i - 1})}^{2} + \ldots + \sigma_{({i - v})}^{2}} \right)}} \right)} + \frac{\left( {\left( {l_{i} + l_{({i - 1})} + \ldots + l_{({i - v})}} \right) - \left( {a_{j} + a_{({j - 1})} + \ldots + a_{({j - u})}} \right)} \right)^{2}}{2\left( {\sigma_{i}^{2} + \sigma_{({i - 1})}^{2} + \ldots + \sigma_{({i - v})}^{2}} \right)} + {\left( {u - 1} \right)\ln\frac{1}{1 - p_{c}}} + {\left( {v - 1} \right)\ln{\frac{1}{p_{f}}.}}$B. Dynamic Programming Procedure

The validation of a sequence map against the optical map can beimplemented as a dynamic programming procedure (“DPP”). Detaileddescriptions of the DPP are provided in T. H. Cormen et al.,“Introduction to Algorithms”, The MIT Press and McGraw-Hill, 1990, andD. Gusfield, “Algorithms on Strings, Trees, and Sequences”, CambridgeUniversity Press, 1997, the entire disclosures of which is incorporatedherein by reference. An exemplary DPP for the process according to thepresent invention is as follows:

-   -   Procedure sequence-map-validate (sequence-map,        consensus-map)/*Other parameters will be specified . . . e.g.,        p_(f), p_(c), k, etc. */begin        -   run DPP on consensus-map and sequence map;        -   run DPP on consensus-map and reversed sequence map;        -   collect the k “best” alignments by examining the last row of            both DPP tables and “return” them;    -   end

This DPP procedure can be executed two or more times. It is improbablefor two alignments for the sequence map and for its reversed version tohave equivalent scores. It is preferable to start from the DPP's mainrecurrence to obtain a formulation of the sequence map vs. consensus mapmatching expression.

Dynamic Programming “Main” Recurrence

For the description provided below, index i shall be used to indicate afragment in the consensus map, and the index j to indicate a fragment inthe sequence map. Assuming that the consensus map has M fragments andthat the sequence map has N fragments, the DPP may preferably utilize aN×M matching table T. Considering the entry T[i, j], this entry willlikely contain the partially computed value of the matching function F(. . . ). For example, F( . . . ) would be incrementally computed from“left” to “right” by taking into consideration all possible fragment byfragment matches.

The main recurrence for entry T[i, j] is provided as follows:

${T\left\lbrack {i,j} \right\rbrack}:={\begin{matrix}\min \\{0 < u \leq i} \\{0 < v \leq j}\end{matrix}{\begin{pmatrix}{{T\left\lbrack {{i - u},{j - v}} \right\rbrack} + {\ln\left( \frac{\sqrt{2{\pi\left( {\sigma_{i}^{2} + \sigma_{({i - 1})}^{2} + \ldots + \sigma_{({i - v})}^{2}} \right)}}}{p_{c}} \right)} +} \\{\frac{\left( {\left( {l_{i} + l_{({i - 1})} + \ldots + l_{({i - v})}} \right) - \left( {a_{j} + a_{({j - 1})} + \ldots + a_{({j - u})}} \right)} \right)^{2}}{2\left( {\sigma_{i}^{2} + \sigma_{({i - 1})}^{2} + \ldots + \sigma_{({i - v})}^{2}} \right)} +} \\{{\left( {u - 1} \right)\ln\frac{1}{1 - p_{c}}} + {\left( {v - 1} \right)\ln\frac{1}{p_{f}}}}\end{pmatrix}.}}$

The determination of the respective sizes of u and v should beperformed. In one exemplary embodiment of the present invention, thesizes of u and v should preferably depend on σ_(i)'s. In anotherexemplary embodiment of the present invention, u and v may depend alsoon the digestion rate of the “in vivo” experiment that breaks up the DNAmolecule. However, a pragmatic bound may be equal to, e.g., three timesthe overall standard deviation (which in practice can be approximated bythe value 3). This bound may preferably become a parameter of the DPP.In this way, the computation for each entry T[•,•] should considerapproximately nine neighboring or adjacent entries.

A simple model for the initial conditions should preferably be asfollows:T[i,0]:=∞, for iε[1,N].T[i,0]:=0, for jε[1,M]In this model, it is preferably to never match or strongly penalize amatch of the first fragments of the consensus map against an “inner”fragment of the sequence map (cf. first column having a ∞ value). Also,the match of any fragment of the consensus map can be made against thefirst fragment of the sequence map rather neutral (with the first twozero values). A more complex model initializes the first row of thedynamic programming table by taking into account, e.g., only the size ofthe i-th fragment. Provided below is an exemplary description of acomplete model for the above-referenced boundary conditions.

Left and Right End Fragment Computations.

It is possible to provide a more sophisticated and accurate model forthe left fragments and right fragments calculations (i.e. for theinitial and final conditions). Such models take into consideration thecase in which certain fragments on either the left or the right of thesequence map do not “properly match” any fragment in the consensus map.

I. Left End Penalty Computation

As shown in FIG. 6F, the first “matching fragments” are a₂ from thesequence map 680, and l_(j) from the consensus map 670, identified bytheir size. The general case is for fragment i of the sequence map 680to match fragment j of the consensus map 670.

An analysis of the fragment α₀ of the sequence map 680 is as follows.Most of the time, the left end of this fragment α₀ (which can assume notto be corresponding to an actual restriction site) will fall within theboundaries of fragment i−n of the consensus map 670 (for 0≦n≦i).

Within this framework, the minimum value that can be assigned to a“match” of the left end fragments of the sequence map 680 corresponds toone of three cases:

-   -   Match by extension of the first left end fragment of the        sequence map 680.    -   Bad matches until fragment i of the sequence map matches        fragments j of the consensus map 670.    -   Match without extension to some fragment in the consensus map        670.

Example 1

Extending α₀ by x leads to a match. If α₀ is “extended” by an extra sizex (as shown in FIG. 6F), x is extended as far to the left as possible tomatch the cut on the left of fragments i−n (e.g., fragment of sizel_(i−2) illustrated in FIG. 6F).

The value of this match (which is built on top of the derivationperformed for the “regular case”) is provided by the followingexpression:

${\ln\left( \frac{\sqrt{2{\pi\left( {\sigma_{({i - n})}^{2} + \sigma_{({i - {({n - 1})}})}^{2} + \ldots + \sigma_{i}^{2}} \right)}}}{p_{c}} \right)} + \frac{\left( {\left( {l_{({i - n})} + l_{({i - {({n - 1})}})} + \ldots + l_{i}} \right) - \left( {x + a_{0} + a_{1} + \ldots + a_{j}} \right)} \right)^{2}}{2\left( {\sigma_{({i - n})}^{2} + \sigma_{({i - {({n - 1})}})}^{2} + \ldots + \sigma_{i}^{2}} \right)} + \frac{x}{L} + {\left( {n - 1} \right)\mspace{11mu}\ln\frac{1}{p_{f}}} + {j\mspace{11mu}\ln{\frac{1}{1 - p_{c}}.}}$

This case express depends on two parameters which did not appear in theregular case:

-   -   x being the size extension (please note it in the second and the        third term), and    -   L being the molecule map average fragment size.

The second sub-term is preferably the regular “sizing error” penaltywhich takes into account the extension x. The third sub-term may add anextra penalty based on the amount of the end fragment being stretchedwith respect to the overall structure of the expression. To utilize theexpression, it is beneficial to find where its minimum with respect tothe position of x. By differentiating in this manner, the expression canbe minimized by setting x as follows:

$x = {\left( {\left( {l_{({i - n})} + l_{({i - {({n - 1})}})} + \ldots + l_{i}} \right) - \left( {a_{0} + a_{1} + \ldots + a_{j}} \right)} \right) - \frac{\left( {\sigma_{({i - n})}^{2} + \sigma_{({i - {({n - 1})}})}^{2} + \ldots + \sigma_{i}^{2}} \right)}{L}}$By substituting this value for x in the original expression, thefollowing expression is obtained:

${\ln\left( \frac{\sqrt{2{\pi\left( {\sigma_{({i - n})}^{2} + \sigma_{({i - {({n - 1})}})}^{2} + \ldots + \sigma_{i}^{2}} \right)}}}{p_{c}} \right)} + \frac{\left( {\left( {l_{({i - n})} + l_{({i - {({n - 1})}})} + \ldots + l_{i}} \right) - \left( {a_{0} + a_{1} + \ldots + a_{j}} \right)} \right)}{L} + {\left( {- \frac{1}{2L^{2}}} \right)\left( {\sigma_{({i - n})}^{2} + \sigma_{({i - {({n - 1})}})}^{2} + \ldots + \sigma_{i}^{2}} \right)} + {n\mspace{11mu}\ln\frac{1}{p_{f}}} + {j\mspace{11mu}\ln{\frac{1}{1 - p_{c}}.}}$Again, the last two sub-terms may account for the false cuts and themissing cuts, respectively. It is possible to assume that there is atleast one “good” cut in the sequence map.

Example 2

No extension and bad matches until i and j. In this case, the first“good match” is located when fragment i of the sequence map matchesfragments j of the consensus map. The expression corresponding to thiscase is

${n\;{\ln\left( \frac{1}{p_{f}} \right)}} + {\left( {j + 1} \right){\ln\left( \frac{1}{1 - p_{c}} \right)}}$This expression takes into consideration (and possibly corrects) allmissing matches and the false matches in both maps (e.g., the j+1 termtakes into account the 0-th cut as a missing cut).Case 3: Match without extension to some fragment in the consensus map.It shall be assumed that a “good match” exists between fragment i of theconsensus map and fragments j of the sequence map, and, as with Example1 of this subsection, the fragment from the consensus map (which iswithin which the end of fragment 0— size α₀—of the sequence map lies) isindexed i−n.

A match of the fragment 0 of the sequence map to any of the n fragmentsup to fragment i of the consensus map as then attempted. All possiblemissing cuts and false cuts along the way are taken into consideration.The attempt of minimizing the following expression (dependent on k) willlikely compete against the expressions in Examples 1 and 2 for the bestend match.

$\min_{0 \leq k \leq i}{\begin{pmatrix}{\frac{\left( {\left( {l_{({i - k})} + l_{({i - {({k - 1})}})} + \ldots + l_{i}} \right) - \left( {x + a_{0} + a_{1} + \ldots + a_{j}} \right)} \right)^{2}}{2\left( {\sigma_{({i - k})}^{2} + \sigma_{({i - {({k - 1})}})}^{2} + \ldots + \sigma_{i}^{2}} \right)} +} \\{{\left( {k - 1} \right)\mspace{11mu}\ln\frac{1}{p_{f}}} + {j\mspace{11mu}\ln\frac{1}{1 - p_{c}}}}\end{pmatrix}.}$

II. Right End Penalty Computation

FIG. 6G shows an exemplary illustration of the maps which are utilizedfor the right end penalty computation, i.e., for fragments trailing theend of the sequence map 690 and/or of the consensus map 680. Thiscomputation is almost symmetric to the left end penalty computationdescribed above.

However, there is a difference to be taken into account for the rightend computation which makes the computation asymmetrical with respect tothe left end penalty computation described above. When the “last goodmatch” between fragment i of the consensus map 670 and fragment j of thesequence map 690 is considered, a consideration of what is the score ofthe match up to that point should also be undertaken. In particular, thevalue T[, i] should be considered (thus assumed to be available at thatpoint).

Thus, as per the left end computation, three terms should be considered.They are analogous to the three terms for the left end computation, butthey should be augmented with T[j, i] to be meaningful.

III. Description of the Exemplary Validation Procedure

FIG. 7 shows a detailed illustration of the exemplary flow diagram andarchitecture of the validation procedure according to the presentinvention which utilizes dynamic programming principles and the sequenceand consensus maps illustrated in FIGS. 6F and 6G. Each box representsthe solution of a “dynamic programming”-like problem. In particular, themap data is provided to a left end table 360 which then passes at leasta portion of such data to a middle table 365. The output of both theleft end table 360 and the middle table 365 are combined in block 370,and the combined results are forwarded to a results table I 375. Then,at least a portion of the data from the results table I 375 is passed toa right end table 380, and the combined results are forwarded to aresults table II 385. The data in the results table I 375 and theresults table II 385 are computed using the scores contained in theother tables (e.g., the left end table 360, the middle table 365 and theright end table 380). The overall computation uses these three tables360, 365, 380 as follows:

-   -   the T[.,.] for the middle table computation;    -   the TL[.,.] for the left end penalty computation; and    -   the TR[.,.] for the right end penalty computation.        It is also possible to re-use certain tables to save memory and        system resources of the processing device 10. The flow of        control produces the content of each table 360, 365, 380, in        turn, and the final resulting table (e.g., the results table II        385) can be examined to reconstruct the alignment trace-back.

IV. Possible Optimization

Filling the entire T[.,.] table, i.e., the middle table 365, may take onthe order of 4 times O(N²M min(N,M)) to complete, where N is the size ofthe sequence map and M is the size of the consensus map. However, it ispossible to optimize the filling of the middle table 365 down to O(NMmin(N,M)) by utilizing the limiting argument on the computationperformed for each entry T[i, j]. Because of the limit on u and v, thecomputation time for each entry can be considered “constant”.

In a simple setup, the middle table 365 may take up O(NM) space, henceit too may be quadratic even when extra “backtrace recording” isconsidered, as described in Gusfield, D., “Algorithms on Strings, Trees,and Sequences”, Cambridge University Press, 1997.

It is also possible to optimize the execution time via a hashing schemesimilarly to the scheme used in the “gentig” program. In such case, thetime complexity can be reduced by a further order of magnitude.

Experimental Results

The first experiments using software based on the system and methoddescribed above checked “in silico” maps obtained from Plasmodiumfalciparum sequence data against optical ordered restriction maps forthe same organism.

I. Plasmodium falciparum Sequence Data

The sequence for the Pasmodium falciparum 's 14 chromosomes was obtainedfrom the Sanger Institute database (www.sanger.ac.uk) and from the TIGRdatabase (www.tigr.org). The experiment cut the sequences “in silico”using the BamHI restriction enzyme. The resulting maps were fed to thesoftware (implementing the process according to the present invention)along with appropriate optical ordered restriction maps.

The results of the experiments on chromosome 2 and chromosome 3 (showinga number pf fragments) are provided below, as well as the experiment onall chromosomes using a particular enzyme (e.g., NheI).

Number of Fragments Chromosome from DB reversed chr 2 30 23 chr 3 36 28

Two “in silico” maps were provided for the chromosome 2 and chromosome 3sequences with the fragment numbers obtained being provided in the tableabove. The molecule maps thus produced were then sent to the validationchecker alongside various consensus maps.

II. Plasmodium falciparum Optical Ordered Restriction

An optical ordered restriction map published in J. Jing et al., “OpticalMapping of Plasmodium Falciparum Chromosome 2”, Genome Research,9:175-181, 1999 and Z. Lai et al., “A shotgun optical map of the entirePlasmodium Falciparum genome”, Nature Genetics, 23:309-313, 1999, andthe maps generated by the “gentig” program were utilized for thisexperiment. The “gentig” program provided the use of the indication ofthe overall standard deviation to be used for each fragment of theconsensus map. The parameter used was:{circumflex over (σ)}=4.4754 Kbps,and each fragment was assigned a standard deviation of:

${\hat{\sigma}\sqrt{\frac{l}{L}}},{Kbps}$where l is the fragment size and L is the average consensus map fragmentsize.

III. Validation Procedure Results

The validation DPP according to the present invention was executed onchromosome 2 and chromosome 3. The DPP ran with the followinglimitations:

-   -   The u and v parameters for the main recurrence formula were set        to 3.    -   The procedure for matching the left and right ends of the        sequence maps using the special computations described above was        not utilized.

The summary of the results are provided below in Tables 1-3. Table 1 and3 show the match of the sequence maps for chromosomes 2 and 3 againstthe consensus maps generated by the “gentig”. Table 2 shows the match ofthe sequence maps against the consensus map which as published in M. J.Gardner et al., “Chromosome 2 sequence of the human malaria parasitePlasmodium Falciparum”, Science, 282:1126-1132, 1998. The position ofthe matches of the sequence against the consensus maps are also shown inTables 1-3.

TABLE 1 Chromosome 2 Validation Summary A # missing # false rank matchesscore map id cuts cuts 1 29  80.869 1302 0 1 2 28 105.861 1302 2 1 3 18126.956 1326 12  4 4 22 127.488 1305 8 4 5 18 132.890 1414 12  2

In particular, Table 1 shows the data for the best “matches” found bythe validation procedure of the present invention for the case ofPlasmodium falciparum chromosome 2. The “in silico” sequence map wasobtained from the TIGR database sequence. The sequence map (as well asits reversed) was checked against 75 (optical) consensus maps producedby the gentig program. The 75 optical maps cover the entire Plasmodiumfalciparum genome. The validation procedure located its best matchesagainst the map tagged 1302.

TABLE 2 Chromosome 2 Validation Summary B # missing # false rank matchesscore map id cuts cuts 1 29  77.308 NYU-WISC 1 0 2 22 125.088 NYU-WISC 82 3 22 130.866 NYU-WISC 8 4 4 24 131.475 NYU-WISC 6 1 5 24 132.838NYU-WISC 6 4

Table 2 shows the data for the best “matches” found by the validationprocedure of the present invention for the case of Plasmodium falciparumchromosome 2. The “in silico” sequence map was obtained from the TIGRdatabase sequence. The sequence map (as well as its reverse) was checkedagainst the map published in M. J. Gardner et al. publication.

TABLE 3 Chromosome 3 Validation Summary # missing # false rank matchesscore map id cuts cuts 1 35 108.360 1365 1 0 2 32 117.571 1365 4 1 3 32119.956 1365 4 2 4 35 121.786 1296 1 3 5 31 125.265 1365 5 1

Table 3 shows the data for the “best” matches found by the validationprocedure of the present invention for the case of Plasmodium falciparumchromosome 3. The “in silico” sequence map was obtained from the SangerInstitute database sequence. The sequence map (as well as its reversed)was checked against 75 (optical) consensus maps produced by gentig. The75 optical maps cover the entire Plasmodium falciparum genome. Thevalidation procedure located its best matches against the map tagged1365.

The processing device 10 of the present invention was executed atapproximately 75×4=300 DPP instances in about 5 minutes during theexperiment. Also, during this experiment, the processing device 10 kepttrack of all the intermediate results and made them available forinteractive inspection after the actual execution. Also, the sequence,the sequence map, and the consensus maps, were always available forinspection and manipulation

IV. Conclusion

The statistical model of an exemplary embodiment of the presentinvention is essentially a formulation of a maximum likelihood problemwhich is solved by minimizing a weighted sum-of-square-error score. Thesolution is computed by constructing a “matching table” using a dynamicprogramming approach whose overall complexity is of the order O(M min(N,M)) (for our non-optimized solution), where N is the length of theconsensus map and M is the length of the consensus map. The preliminaryresults of the experiment described above illustrate how the process andsystem of the present invention can be used in assessing the accuracy ofvarious sequence and map data currently being published in a variety offormats from a many different sources.

B. Alignment and Reordering Process and System

Overall Alignment Process Flow Diagram

FIG. 8 shows an exemplary embodiment of the process for aligningsequences using optical maps according to the present invention whichcan also be executed by the processing device 10 of FIGS. 1 and 2. Inthis exemplary embodiment and similarly to the validation processillustrated in FIG. 3, the optical mapping data 140 is forwarded to atechnique 250 (e.g., the “gentig” program) which constructs one or moreconsensus maps 260 based on the optical mapping data 140 by consideringthe local variations among aligned single molecule maps.

According to this exemplary embodiment of the alignment process of thepresent invention, the particular DNA sequence 210 or a portion of suchDNA sequence is provided. Thereafter, the data for this DNA sequence (ora portion thereof) is forwarded to a technique 220 which simulates arestriction enzyme digestion process to generate an “in silico” orderedrestriction sequence map 230. The system and process of the presentinvention may then executes the validation algorithm 270 whichdetermines the accuracy of the ordered restriction sequence map 230based on the data provided in the optical consensus map(s) 260. As withthe validation procedure of FIG. 3, this result can be output 280 in theform of a response a score (e.g., a rank for each ordered restrictionmap), a binary output (e.g., the accuracy validated vs. unvalidated),etc. The exemplary embodiments of the validation process and system ofthe present invention have been described in great detail herein above.Finally, the simulated ordered restriction sequence map(s) can bealigned against the optical ordered restriction map in block 400. In oneexemplary embodiment of the alignment process of the present invention,for each simulated ordered restriction map, the best anchoring positionof such map is located on the ordered restriction consensus map (e.g. anoptical consensus map). The result of such location procedure is thegeneration of the entire set of anchoring positions of the simulatedordered restriction maps. In one preferred embodiment, the bestanchoring positions are provided first to effectuate the best possiblealignment. This can be done using a one-dimensional Dynamic ProgrammingProcedure. Those having ordinary skill in the art would clearlyunderstand that it is possible to produce multiple alignments for thesimulated ordered restriction maps due to many anchoring positions thanmay be available. Provided below are further details of the alignmentprocess and system according to the present invention.

Detailed Flow Diagram of Alignment Process

FIG. 9 shows an exemplary flow chart of the embodiment of the processaccording to the present invention for simulating a restrictiondigestion of the sequence map, validating the accuracy of the consensusoptical order restriction map and/or the simulated map, and constructingan alignment therefore. This process can be performed by the processingdevice 10 which is shown in FIGS. 1 and 2. Similarly to the validationprocess shown in FIG. 4, the processing device 10 receives the opticalordered restriction data in step 410, which can be the consensus opticalmap(s) 260 shown in FIG. 8. Then, in step 420, the processing device 10receives the sequence data, which is preferably the DNA sequence data210 also shown in FIG. 8. In step 430, the restriction digestion of thesequence data is simulated to obtain the simulated (in silico) orderedrestriction map which is also shown in FIG. 8 as the sequence map(s)230. Thereafter, the optical ordered restriction map is compared to thesimulated ordered restriction map to obtain one or more sets of mostlikely matches (step 440). The processing device 10 then determines ifall the simulated ordered restriction maps were checked in step 445. Ifnot, the process takes the next simulated ordered restriction map to bethe current simulated ordered restriction map to be checked in step 450,and the comparison of step 440 is repeated again for the currentsimulated ordered restriction map. Otherwise, since it is determinedthat all the simulated ordered restriction maps were checked, all of thematches are ranked in step 460, and the processing device 10 determinesthe best match(s) for each simulated ordered restriction map based onthe respective ranks in step 470. Then, in step 480, the alignment ofthe simulated ordered restriction map is constructed with respect to theoptical ordered restriction maps based on the score of the matches.

Global Alignment

To reiterate, the validation process and system of the present inventiondescribed above can match an ordered restriction sequence map against anordered restriction consensus map. This validation process and systemcan be possibly described as a positioning process of the sequence mapagainst the consensus map. When many sequences positioning are takeninto consideration, it may be possible to describe the validationprocess as a “global” collective alignment against a particularconsensus map. Thus, for the sake of clarity, the output of theprocedure that produces this final result shall be referred to hereinbelow as an alignment.

For example, the result of n “validation experiments” can be identifiedas n sets of possible sequence positions along the consensus map. Eachof these results can be denoted as set S_(i) (with 0<i≦n), with|S_(i)|=k. Each of the k items in each S_(i) is a triple [s_(i),x_((i,j)), v_((i,j))]—where S_(i) is a sequence map identifier,x_((i,j)) is the j-th alignment of s_(i) against the consensus map, andv_((i,j)) is the sequence alignment score (with 0<j≦k) obtained from thesingle sequence (map) positioning process. The set containing everyS_(i) (with 0<i≦n) is called S.

An exemplary embodiment of the procedure to perform the matching,ranking and alignment steps 440-480 using the sequence maps and costsdescribed above is provided below with reference to FIG. 10. The endresult will preferably be an alignment whose overall cost C can becomputed by summing all the costs v_((i,j)) eventually selected.

Initially, in step 510, the global cost C is set to infinity. Then, instep 520, the best matches out of each set S_(i) of simulated orderedrestriction maps (i.e., sequence maps) against the optical orderedrestriction map (i.e., the consensus map) are selected. The best matchesare grouped into a set of triples called T_(s), and the cost v(i, j) andthe position x_((i,j)) of each respective sequence S_(i) are analyzed instep 525. A set, S_(i), is selected from the simulated orderedrestriction map S in step 530. The cost V of this set of triples T_(S)is then computed using, e.g., a specialized 1D Dynamic ProgrammingProcedure (step 540), and compared to C. If V is equal to C plus orminus a tolerance value (step 550), then the set of triples T_(S) isdetermined to be the result of the alignment procedure (step 580). If Vis not equal to C plus or minus a tolerance value, then first C isequated to V at step 560, and the triple [s_(i), x_((i,j′)), v_((i,j′))]corresponding to the best of the “second best” among the S_(i)'s isselected (step 570). The triple [s_(i), x_((i,j)), v_((i,j))] is thenremoved from the set of triples T_(S), and the triple [s_(i),x_((i,j′)), v_((i,j′))] (with j different from j′) is inserted into theset of triples T_(S) (step 575). A set S_(i) is again selected at step530. A new V is then computed from the updated set of triples T_(S)(step 540).

Provided below is an exemplary map-based alignment algorithm/problemwhich can be utilized with the alignment process and system of thepresent invention. Let S=∪_(i)S_(i). For example, at most one triplefrom each S_(i), can be selected while satisfying the following globalconditions/objectives which can possibly be relaxed:

-   -   1. When anchoring two or more selected triples within the        alignment T_(S), two selected sequences s_(p) and s_(q) anchored        at their respective x_((p,b)) and x_((q,a)), preferably do not        overlap (for suitable p, q, a, and b and p not equal to q);    -   2. Σ(I_(i)×v_((i,j))) is minimized over each j in the sequences        set S_(i) so that as many as possible sequence maps S_(i)'s are        included in the alignment; and    -   3. the number of non-selected sequences, n−Σ_(i)I_(i) is        minimized.        where I_(i) is an indicator variable assuming a value 1 if the        triplet from the sequence S_(i) is included in the chosen set        T_(S), and 0 otherwise.

It should be understood that the objectives (2) and (3) provided abovemay conflict. In particular, the minimum of the objective (2) isachieved when no sequence is selected, while with the objective (3), itis preferable to choose as many sequences as possible, irrespective ofthe score values. This conflict may be resolved by, e.g., a weightingscheme involving a Lagrangian-like term which linearly combines the twocontradictory objectives.

It is possible to solve this problem by using various approximationalgorithms. For example, the following two algorithms/procedures:

1. a “Greedy” algorithm/procedure, and

2. a “Dynamic Programming” algorithm/procedure.

During the experimentation of the alignment system and process of thepresent invention, the Greedy algorithm/procedure and the DynamicProgramming algorithm/procedure were utilized with successful results.Provided below are the detailed description of thesealgorithms/procedures (1)-(2) of the present invention.

Greedy Algorithm/Procedure

A solution P can be constructed such that each S_(i) is ordered by valuev_((i,j)). Then, the best item from each sequence S_(i) is placed in thepartial solution P by selecting the sequences in the order imposed byeach x_((i,j)). It should be understood that the final solution P is notguaranteed to be optimal; however, this solution may provide the resultswhich may be acceptable to the implementers of the alignment procedures.

Dynamic Programming/Procedure

This algorithm/procedure is based on the traditional dynamic programmingapproach. Indeed, the implementation of this algorithm/procedure isstraight forward and space-efficient as provided below. The problem canfirst be considered for one exemplary case when k=1, and an appropriatealgorithm can be selected. Next, the general case when k>1 can beconsidered, and good approximation heuristics may be devised.

(a) Alignment procedure for Sequence number k being 1. If the number ofsequences k present in each set S_(i) of triples is restricted to be 1(e.g., being the best score), then the problem yields to a feasible andefficient algorithm. In general, if the sequence matches uniquely to onemap location, then this case should apply. An exemplary embodiment ofthe alignment algorithm for the dynamic programming solution,constructing the solution P, is described below. In particular,

-   -   1. Sort all the triples of sequence, cost and position, <s_(i),        x_((i,j)), v_((i,l))> in ascending x_((i,l)) order, and store        the result in a list L. Thereafter, the indices i and j can be        assumed to range over the list L.    -   2. Construct two vectors C[i] and B[i] (0<i≦n), where each entry        in global cost C is defined to be the cost of including si in an        alignment that already contains sequences, or a subset thereof,        up to S_(j); and the index j is stored in B[i].

The update rules for C[i] and B[i] preferably search backward in the Cvector for values which minimize the cost function, and set B to “pointback” to the chosen point. For example,C[i]=max (C[j]+W(λ;i)) such that S_(i) does not overlap with S_(j),0<j<iB[i]=j.W(λ; i) function takes into consideration the conflicting nature of theobjectives described above. Since it is most likely not possible tooptimize both objectives simultaneously, a weight function can begenerated (where a user may supply the parameter λ) which wouldpreferably account for both objectives. Two exemplary W functions areprovided below:W _(i)(λ;i)=|S_(i) |−λ·v _(i),W₂(λ;i)=1−λ·v _(i).W_(i) takes into account the “span” covered by the selected sequences(where |S_(i)| is the size of the sequence). W₂ takes into account thenumber of sequences which were selected. The parameter λ is controlledby the user.

(b) Alignment Procedure for Sequence Number k>1. If sequence number k>1,then the procedure may be more complex. Since for each set S_(i), theremay be k number of alignments to select from, the complexity involved ina straightforward generalization of the preceding procedure isconjectured to grow exponentially. It is possible to use a heuristicprocedure/algorithm to produce an acceptable solution in the case whenthe sequence number k>1. The concept of this procedure is to iterate orrepeat the dynamic programming procedure (i.e., k=1 case) on an inputset that takes the best possible solutions from each sequence S_(i)while ignoring the non-overlapping constraint. This solution can befurther improved in the subsequent iteration by constructing a new inputto the DPP procedure (i.e., where k=1) that consists of the precedingsolution augmented with an element from each sequence S_(i) excluded inthe preceding solution. Because the preceding solution is also asolution of the new expression, the new solution is at least aseffective as the solution previously provided. In each iteration, thebasic solution can also be a general (and possibly suboptimal) solution.Because when an item is removed from consideration, it is never againreconsidered; thus, according to a preferred embodiment of the presentinvention, there can be only O(kn) iterations, and each iterationinvolves O(n²) work. Hence a naive analysis yields an O(kn³) timealgorithm.

Experimental Results

FIG. 11 shows an illustration of a possible alignment of an exemplarychromosome arrangement using the system and method of the presentinvention. In particular, a region of the alignment of P. falciparum'sChromosome 12 is shown therein which was generated using the softwareimplementing an exemplary embodiment of the validation, alignment andreordering system and method of the present invention. The twounderlined maps in position 39 and 50 of the figure illustrate anacceptable anchoring of “contigs” 11 and 13 to the optical orderedrestricted map. Also, the alignment was obtained without any overlapfilter.

One having ordinary skill in the art would clearly recognize that manyother applications of the embodiments of the system and process forvalidating and aligning of the simulated ordered restriction mapsaccording to the present invention. Indeed, the present invention is inno way limited to the exemplary applications and embodiments thereofdescribed above.

1. A process for comparing ordered segments of a first DNA map withordered segments of a second DNA map to determine a level of accuracy ofthe second DNA map with respect to the first DNA map, comprising thesteps of: a) receiving in a processing device the first and second DNAmaps, wherein the first DNA map is a sequence DNA map generated bycutting a DNA molecule using one or more restriction enzymes, and thesecond DNA map is a genomic consensus DNA map in an ordered restrictionDNA map; and b) validating in the processing device the level ofaccuracy of and the second DNA map with respect to the first DNA mapbased on information associated with the first and second DNA maps bycomparing ordered segments of the first DNA map with ordered segments ofthe second DNA map using the following probability density function:Pr(D|Ĥ(σ,p_(c)p_(f))) where: D is the second DNA map, Ĥ is the first DNAmap, σ is a standard deviation summarizing map-wide standard deviationdata, p_(c) is a probability of a positive cut of a DNA sequence, andp_(f) is a probability of a false-positive cut of the DNA sequence,whereby a level of accuracy the second DNA map with respect to the firstDNA map is determined.
 2. The process according to claim 1, wherein thevalidating step comprises determining whether one or more matches existbetween the ordered segments of the first DNA map and the orderedsegments of the second DNA map.
 3. The process according to claim 2,wherein the validating step further comprises obtaining a number of thematches which exist between the ordered segments of the first DNA mapand the ordered segments of the second DNA map after determining whetherone or more matches exist between ordered segments of the first DNA mapand the ordered segments of the second DNA map.
 4. The process accordingto claim 3, wherein the validating step further comprises the substepsof: i. determining whether the first DNA map includes one or more cutswhich are missing from the second DNA map, ii. after substep i,obtaining a first number and locations of the missing cuts based on thefirst and second DNA maps, iii. determining whether the second DNA mapincludes one or more cuts which are missing from the first DNA map, andiv. after substep iii, obtaining a second number and locations of themissing cuts based on the first and second DNA maps.
 5. The processaccording to claim 4, further comprising the step of: c) generating anerror indication if at least one of: i. the number of the matches isless than a match threshold, ii. the first number of the missing cuts isgreater than a first predetermined threshold, and iii. the second numberof the missing cuts is greater than a second predetermined threshold. 6.The process according to claim 1, wherein the validating step comprisesdetermining whether the first DNA map includes one or more cuts whichare missing from the second DNA map.
 7. The process according to claim6, wherein the validating step further comprises obtaining a number andlocations of the missing cuts, after determining whether one or morematches exist between ordered segments of the first DNA map and theordered segments of the second DNA map, based on the first and secondDNA maps.
 8. The process according to claim 1, wherein the validatingstep comprises determining whether the second DNA map includes one ormore cuts which are missing from the first DNA map.
 9. The processaccording to claim 8, wherein the validating step further comprisesobtaining a number and locations of the missing cuts, after determiningwhether one or more matches exist between ordered segments of the firstDNA map and the ordered segments of the second DNA map, based on thefirst and second DNA maps.
 10. The process according to claim 1, whereinthe first DNA map is an in-silico ordered restriction map obtained froma DNA sequence.
 11. The process according to claim 10, wherein the firstDNA map includes identification data and at least one vector of thesegments of the first DNA map.
 12. The process according to claim 11,wherein the at least one vector of the first segments encodes a size ofbase-pairs of the DNA sequence.
 13. The process according to claim 12,wherein the second DNA map includes identification data and at least onevariable-length vector representing its ordered segments.
 14. Theprocess according to claim 1, wherein the second DNA map is asubsequence of a genome-wide ordered restriction map of an optical DNAmap.
 15. The process according to claim 1, wherein the level of accuracyis validated as a function of an orientation of the first DNA map withrespect to an orientation of the second DNA map.
 16. The processaccording to claim 1, wherein the validation step comprises the substepsof: i. executing a dynamic programming procedure (“DPP”) on the firstand second DNA maps to generate a first table of partial and completealignment scores, and first auxiliary tables and first data structuresto keep track of number and locations of cuts and segment matches,wherein the DPP comprises: assembling a N×M matching table T, whereinindex “i” indicates a fragment in a consensus map having M fragments andindex “j” indicates a fragment in a sequence map having N fragments,wherein each entry of the matching table T is computed by${T\left\lbrack {i,j} \right\rbrack} = {\begin{matrix}\min \\{0 < u \leq i} \\{0 < v \leq j}\end{matrix}\begin{pmatrix}{{T\left\lbrack {{i - u},{j - v}} \right\rbrack} +} \\{{\ln\left( \frac{\sqrt{2{\pi\left( {\sigma_{i}^{2} + \sigma_{({i - 1})}^{2} + \ldots + \sigma_{({i - v})}^{2}} \right)}}}{p_{c}} \right)} +} \\{\frac{\left( {\left( {l_{i} + l_{({i - 1})} + \ldots + l_{({i - v})}} \right) - \left( {a_{j} + a_{({j - 1})} + \ldots + a_{({j - u})}} \right)} \right)^{2}}{2\left( {\sigma_{i}^{2} + \sigma_{({i - 1})}^{2} + \ldots + \sigma_{({i - v})}^{2}} \right)} +} \\{{\left( {u - 1} \right)\ln\frac{1}{1 - p_{c}}} +} \\{\left( {v - 1} \right)\ln\frac{1}{p_{f}}}\end{pmatrix}}$ wherein “u” is a given cut in the sequence map, “v” is agiven cut in the consensus map, “p_(c)” is the probability of a positivecut in the sequence map, “p_(f)” is the probability of a false-positivecut in the sequence map, “1” is the fragment length, and “σ_(i)” is theestimated standard deviation of fragment sizes, ii. receiving a thirdDNA map which is a reverse map of the first DNA map, iii. executing theDPP on the second and third DNA maps to generate a second table ofpartial and complete alignment scores, and second auxiliary tables andsecond data structures to keep track of number and locations of the cutsand the segment matches, and iv. analyzing a last row of the first tableand a last row of the second table to obtain at least one optimumalignment of the first and second DNA maps, and v. constructing at leastone of the optimum alignment and suboptimal alignments using the firstand second auxiliary tables and data structures.
 17. The processaccording to claim 1, wherein the level of accuracy is validated bymatching an extension of a first left end segment of the orderedsegments of the first DNA map to at least one of the ordered segments ofthe second DNA map.
 18. The process according to claim 1, wherein thelevel of accuracy is validated by matching an extension of a first rightend segment of the ordered segments of the first DNA map to at least oneof the ordered segments of the second DNA map.
 19. The process accordingto claim 1, further comprising the step of: c) detecting an alignment ofthe first DNA map with respect to the second DNA map, the alignmentbeing indicative of sequence positions of the ordered segments of thefirst DNA map along the second DNA map.
 20. A software system which,when executed on a processing device, configures the processing deviceto compare segments of a first DNA map with segments of a second DNA mapto determine a level of accuracy of the second DNA map with respect tothe first DNA map, the software system comprising: a processing device;a processing subsystem stored in the processing device and which, whenexecuted on the processing device, configures the processing device toperform the following steps: a) receives the first and second DNA maps,wherein the first DNA map is a sequence DNA map generated by cutting aDNA molecule using one or more restriction enzymes, and the second DNAmap is a genomic consensus DNA map in an ordered restriction DNA map, b)validates the level of accuracy of the second DNA map with respect tothe first DNA map based on information associated with the first andsecond DNA maps by comparing ordered segments of the first DNA map withordered segments of the second DNA map using the following probabilitydensity function:Pr(D|Ĥ(σ,p_(c),p_(f))) where: D is the second DNA map, Ĥ is the firstDNA map, σ is a standard deviation summarizing map-wide standarddeviation data, p_(c) is a probability of a positive cut of a DNAsequence, and p_(f) is a probability of a false-positive cut of the DNAsequence, whereby a level of accuracy of the second DNA map with respectto the first DNA map is determined, and c) outputs the level of accuracyto a user.
 21. The software system according to claim 20, wherein, whenvalidating the level of accuracy, the processing subsystem determineswhether one or more matches exists between at least one of the segmentsof the first DNA map and at least one of the segments of the second DNAmap.
 22. The software system according to claim 21, wherein, whenvalidating the level of accuracy, the processing subsystem obtains anumber of the matches which exist between the segments of the first DNAmap and the segments of the second DNA map after determining whether oneor more matches exist between the ordered segments of the first DNA mapand the ordered segmetns of the second DNA map.
 23. The software systemaccording to claim 22, wherein, when validating the level of accuracy,the processing subsystem: i. determines whether the first DNA mapincludes one or more cuts which are missing from the second DNA map, ii.obtains number and location of the missing cuts based on the first andsecond DNA maps, iii. determines whether the second DNA map includes oneor cuts which are missing from the first DNA map, and iv. obtains asecond number of the missing cuts based on the first and second DNAmaps.
 24. The software system according to claim 23, wherein, whenexecuted on the processing device, the processing subsystem furtherconfigures the processing device to generate an error indication if atleast one of: i. the number of the matches is less than a matchthreshold, ii. the first number of the missing cuts is greater than afirst predetermined threshold, and iii. the second number of the missingcuts is greater than a second predetermined threshold.
 25. The softwaresystem according to claim 20, wherein, when validating the level ofaccuracy, the processing subsystem determines whether the first DNA mapincludes one or more cuts which are missing from the second DNA map. 26.The software system according to claim 25, wherein, when validating thelevel of accuracy, the processing subsystem obtains number and locationof the missing cuts based on the first and second DNA maps.
 27. Thesoftware system according to claim 20, wherein, when validating thelevel of accuracy, the processing subsystem obtains number and locationof the missing cuts, after determining whether one or more matches existbetween ordered segments of the first DNA map and the ordered segmentsof the second DNA map, based on the first and second DNA maps.
 28. Thesoftware system according to claim 20, wherein, when validating thelevel of accuracy, the processing subsystem determines whether thesecond DNA map includes one or more cuts which are missing from thefirst DNA map.
 29. The software system according to claim 20, whereinthe first DNA map is an in-silico ordered restriction map obtained froma DNA sequence.
 30. The software system according to claim 29, whereinthe first DNA map includes identification data and a variable-lengthvector of the segments of the first DNA map.
 31. The software systemaccording to claim 30, wherein the vector of the segments of the firstDNA map encodes a size of base pairs of the DNA sequence.
 32. Thesoftware system according to claim 31, wherein the second DNA mapincludes identification data and a variable length vector of thesegments of the second DNA map.
 33. The software system according toclaim 20, wherein the second DNA map is a genome-wide orderedrestriction map of an optical DNA map.
 34. The software system accordingto claim 20, wherein the level of accuracy is validated as a function ofan orientation of the first DNA map with respect to an orientation ofthe second DNA map.
 35. The software system according to claim 20,wherein, when validating the level of accuracy, the processingsubsystem: i. executes a dynamic programming procedure (“DPP”) on thefirst and second DNA maps to generate a first table of partial andcomplete alignment scores, and first auxiliary tables and datastructures to keep track of number and locations of cuts and segmentmatches, wherein the DPP comprises: assembling a N×M matching table T,wherein index “i” indicates a fragment in a consensus map having Mfragments and index “j” indicates a fragment in a sequence map having Nfragments, wherein each entry of the matching table T is computed by${T\left\lbrack {i,j} \right\rbrack} = {\begin{matrix}\min \\{0 < u \leq i} \\{0 < v \leq j}\end{matrix}\begin{pmatrix}{{T\left\lbrack {{i - u},{j - v}} \right\rbrack} +} \\{{\ln\left( \frac{\sqrt{2{\pi\left( {\sigma_{i}^{2} + \sigma_{({i - 1})}^{2} + \ldots + \sigma_{({i - v})}^{2}} \right)}}}{p_{c}} \right)} +} \\{\frac{\left( {\left( {l_{i} + l_{({i - 1})} + \ldots + l_{({i - v})}} \right) - \left( {a_{j} + a_{({j - 1})} + \ldots + a_{({j - u})}} \right)} \right)^{2}}{2\left( {\sigma_{i}^{2} + \sigma_{({i - 1})}^{2} + \ldots + \sigma_{({i - v})}^{2}} \right)} +} \\{{\left( {u - 1} \right)\ln\frac{1}{1 - p_{c}}} +} \\{\left( {v - 1} \right)\ln\frac{1}{p_{f}}}\end{pmatrix}}$ wherein “u” is a given cut in the sequence map, “v” is agiven cut in the consensus map, “p_(c)” is the probability of a positivecut in the sequence map, “p_(f)” is the probability of a false-positivecut in the sequence map, “1” is the fragment length, and “σ_(i)” is theestimated standard deviation of fragment sizes, ii. receives a third DNAmap which is a reverse map of the first DNA map, iii. executes the DPPon the second and third DNA maps to generate a second table of partialand complete alignment scores, and second auxiliary tables and datastructures to keep track of number and locations of cuts and segmentmatches, iv. analyzes a last row of the first table and a last row ofthe second table to obtain at least one optimum alignment of the firstand second DNA maps, and v. constructing at least one of the optimumalignment and suboptimal alignments using the first and second auxiliarytables and data structures.
 36. The software system according to claim20, wherein the level of accuracy is validated by matching an extensionof a first left end segment of the segments of the first DNA map to atleast one of the segments of the second DNA map.
 37. The software systemaccording to claim 20, wherein the level of accuracy is validated bymatching an extension of a first right end segment of the first DNA mapto at least one of the segments of the second DNA map.
 38. The softwaresystem according to claim 20, wherein, when executed on the processingdevice, the processing subsystem further configures the processingdevice to determine an alignment of the first DNA map with respect tothe second DNA map, the alignment being indicative of sequence positionsof the first segments along the second DNA map.